Do NLP and machine learning improve traditional readability formulas?

نویسندگان

  • Thomas François
  • Eleni Miltsakaki
چکیده

Readability formulas are methods used to match texts with the readers’ reading level. Several methodological paradigms have previously been investigated in the field. The most popular paradigm dates several decades back and gave rise to well known readability formulas such as the Flesch formula (among several others). This paper compares this approach (henceforth ”classic”) with an emerging paradigm which uses sophisticated NLPenabled features and machine learning techniques. Our experiments, carried on a corpus of texts for French as a foreign language, yield four main results: (1) the new readability formula performed better than the “classic” formula; (2) “non-classic” features were slightly more informative than “classic” features; (3) modern machine learning algorithms did not improve the explanatory power of our readability model, but allowed to better classify new observations; and (4) combining “classic” and “non-classic” features resulted in a significant gain in performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Les apports du TAL à la lisibilité du français langue étrangère

This paper presents a set of experiments aiming to (1) assess the contribution of NLP to the specific issue of the readability of texts for French as a foreign language (FFL) readers and (2) to propose a new readability formula for FFL. This new model relies on 46 textual features representative of the lexical, syntactic, and semantic levels as well as some of the specificities of the FFL conte...

متن کامل

Workshop Predicting and Improving Readability

s Scott Crossley Crowdsourcing text complexity models The current study builds on work by De Clercq et al. (2014) and Crossley et al. (2017) by using crowdsourcing techniques to collect human ratings of text comprehension, processing, and familiarity across a large corpus comprising a diverse variety of topic domains (science, technology, and history). Pairwise comparisons among the ratings wer...

متن کامل

Text Readability Classification of Textbooks of a Low-Resource Language

There are many languages considered to be low-density languages, either because the population speaking the language is not very large, or because insufficient digitized text material is available in the language even though millions of people speak the language. Bangla is one of the latter ones. Readability classification is an important Natural Language Processing (NLP) application that can b...

متن کامل

Using the crowd for readability prediction

Inspired by previous work on crowdsourcing we investigate two different methodologies to assess the readability of a wide variety of text material by implementing two assessment tools. A lightweight crowdsourcing tool which invites users to provide pairwise comparisons and a more advanced version where experts can rank a batch of texts based on readability. In order to validate this approach, r...

متن کامل

Automatic Construction of Large Readability Corpora

This work presents a framework for the automatic construction of large Web corpora classified by readability level. We compare different Machine Learning classifiers for the task of readability assessment focusing on Portuguese and English texts, analysing the impact of variables like the feature inventory used in the resulting corpus. In a comparison between shallow and deeper features, the fo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012